法国专利FR3043816A1 METHOD FOR SUGGESTION OF CONTENT EXTRACTED FROM A SET OF INFORMATION SOURCES

专利PDF首页>>法国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
The invention relates to a method of suggesting content extracted from a set of information sources comprising steps of: extracting contents from a set of information sources according to key words determined by a curative user and enriched by keywords suggested by a suggestion engine; selection of the content to be published; modification of the suggestion engine parameters based on the actions and the reactions of the users compared to the published contents.
公开号:FR3043816A1
申请号:FR1661104
申请日:2016-11-16
公开日:2017-05-19
发明作者:Marc Rougier
申请人:Scoop It；
IPC主号:

专利说明:

TECHNICAL AREA
The present invention relates to the field of information processing methods and particularly search engines.
It relates more particularly to a method of suggesting content extracted from a set of information sources.
STATE OF THE ART The extraction of data from a very large volume of data is generally referred to by the generic term Anglo-Saxon of "big data". This is for example search information on a predetermined subject.
When the data are press articles, and the purpose is to group such articles around a given theme, for example to offer them to the reading of an audience interested in this field, we speak of curation of content.
Every day, millions of new web pages are published, relating to countless topics. Reading all these pages is naturally impossible for a human reader who is interested in a particular subject, covered directly or indirectly by some of the new pages published, through text, photos, tables etc. The need for curation of content, ie interface of intelligent compilation of data between the web and the readers, results from this observation.
The curation of data poses several problems, including speed of execution, quality of selected results, highlighting these results so that readers can easily find them in the middle of the background noise of new pages published etc.
STATEMENT OF THE INVENTION
The present invention aims to solve some of the problems mentioned above. In particular, it aims at a method of curating content that is more effective in terms of the relevance of the content suggestions.
Advantageously, the method is also more effective in terms of visibility of content retained on previously selected publication sites.
In a first aspect, the invention is directed to a method of suggesting content extracted from a set of information sources.
The method comprises the following steps: 301 for determining, by a curator user, keywords and / or search sources to be used, thereby defining search criteria for the user; 302 enriching the search criteria of the user with keywords and / or sources suggested by a suggestion engine, thereby defining a search strategy; retrieving at least one of the content from the search strategy's source set based on the keywords of the search strategy; sorting content according to a distance from the user's search criteria; displaying classified contents to the curator user; 303 selection by the curator user of content deemed relevant; 304 publishing content selected by the user curator to users readers; 305 recording feedback from curator users and reader users for each published content; 306 automatically modifying the suggestion engine suggestion parameters based on the analysis by the suggestion engine of the recorded reactions, thereby creating a feedback loop and learning of said suggestion engine.
In this way, the suggestion engine is enriched automatically by the actions of curator users and / or user readers. The suggestions proposed by the suggestion engine are thus improved and more relevant, which facilitates the task of curation curative users.
In a particular mode of implementation, the method also comprises a step 308, of determining one or more publication sites, and publication dates, according to previously determined criteria for maximizing the visibility of the publication thus produced.
In a particular mode of implementation, the method comprises: a step 406 of calculating a number of views for each page viewed by one, a part or all the users curators and users readers, a step 410 of calculation of a score for each content, according to the number of pages viewed and the type of action a user has taken on this content: follow 407, share 408, recommend 409, - step 411 score analysis to recommend and categorize content for qualification by users,
In a particular mode of implementation, the categorization is carried out using at least one automatic machine learning algorithm.
In a particular mode of implementation, the suggestion engine implements: a step 502, in which the key words chosen in step 301 are used to browse data domains to extract URL addresses of pages relevant to the look at these keywords, a step 501, in which the system retrieves the contents of the selected web pages and stores them in memory, a step 503, in which the system extracts from these web pages the texts, images and possible RSS feed addresses associated, a step 504, in which these RSS feeds are stored and come to feed a database, a step 506, looping the RSS feed URLs to search the RSS URLs corresponding to the predefined keywords, a step 507 of loading of these RSS feeds, a step 514 in which, from the texts, images and RSS feeds extracted during step 503, the system indexes and stores elements of suggestions, these suggestions to feed a database of suggestions, a step 516, in which the system performs a search of the keywords selected by the user in the suggestions database, a step 517 of filtering the data extracted according to this research to eliminate the pages already viewed by the curator user during a present search, a step 518 of applying other filters previously defined by said user, a step 519, in which the suggestions are sorted according to predefined criteria.
In a particular mode of implementation, in step 308, the system is based on the set of data already existing both at the level of the user and all the users curators and / or users readers to determine for each triplet audience / theme / publication network, a set of moments and intervals of publication considered most effective according to a predetermined criterion.
In a particular mode of implementation, the method furthermore comprises a step in which the system extracts, for each shared article, the number of reactions it generates, calculates a score for each share, and deduces the hours therefrom. conducive to sharing on each of the social networks.
In a particular mode of implementation, the method comprises a step 605 in which the details of each sharing (date, time, content, destination, etc.) are stored, and a step 606, in which the system analyzes the impact obtained. by the sharing made, so as to feed a database for a machine learning.
In a particular mode of implementation, step 306 implements an algorithm based on the behaviors and actions of all curator users and / or user readers, making it possible to highlight content, to categorize it in thematic groups, to connect users who have the same interests.
In a particular mode of implementation, the method comprises: a step 409 for recording the actions of each reader user: reading a content, sharing a content, qualifying a content, or recommending a content , in a manner attached to said content to qualify it. a step 410 of calculating a score for the content, on the basis of these qualification elements collected in the course of time, a step 411 of comparing the score with predefined threshold values, i) if this score of the content is less than a first threshold value that is predetermined but greater than a second predetermined threshold value, a step 703 for calculating categories that are relevant for the content, a step 704 for proposing to the user to choose a category himself, this choice of the user reader being used to improve in a step 705 the automatic category selection engine, a step 706, using the content for recommendations, ii) if this content score is greater than the first threshold value, the content is used for recommendations, and, if it does not have categories attached, the system calculates, as before, categories that will come to describe it.
PRESENTATION OF FIGURES
The characteristics and advantages of the invention will be better appreciated thanks to the description which follows, description which sets out the characteristics of the invention through a non-limiting example of application.
The description is based on the appended figures which represent:
Figure 1: a block diagram of the elements involved in the device,
Figure 2: a schematic illustration of the operating environment of the proposed system,
3: a flowchart of the steps of the reading operation, FIG. 5: a flowchart of the steps of the optimized content suggestion method, FIG. 6: a flowchart of the steps of the information sharing and publication planning process, - Figure 7: a flowchart of the steps of the recommendation and classification process.
DETAILED DESCRIPTION OF AN EMBODIMENT OF THE INVENTION The invention is intended to be implemented in a software manner.
As illustrated schematically in FIG. 1, the method is implemented by one or more curator users 101 and reader users 106. Each of these curator users 101 and reader users 106 works on a computer 102, for example but not exclusively of the PC type. Each computer has means for implementing part of the method.
Each computer 102 is connected, via a network 103 known per se, to various databases 104, as well as to at least one central server 105 on which software implementing another part of the method is executed.
The function of the curator user 101 is to sort data and choose which ones are suitable for properly describing a predetermined subject, which corresponds to the current definition of subject curation. These curator users 101 can be human or algorithmic. In the case where these curator users 101 are algorithmic, a distance is defined between the chosen subject and the data associated with this subject.
The process of "curation of the web" includes software (Web for example but not limited to) to suggest content to users based on their interests, previously defined including various keywords and stored. The goal is to extract the essence of these contents and to recommend the most relevant ones according to the contents already accepted, ie integrated in topics of various users curators 101. The essence of a document here designates Particularly relevant data to characterize this document: eg title and subtitles or paragraph headings, keywords, author, date, photo, most frequently used words, etc.
In the remainder of the description, content is defined as a page of data of the web page type, typically comprising texts, images, update date tags, associated keywords, and so on.
One defines by topic a set of data, for example in the form of web pages, images, texts etc. belonging to the same semantic domain chosen by a user.
We define by visibility the number of times that Internet users come to see a given topic.
The purpose of the system is for the user curator 101 to increase the visibility of its topics on the web by users readers 106, positioning themselves through the system as a specialist in a particular field. A reader is defined by a user 106 who comes to read the content of various topics that interest him.
For this purpose, the system makes it possible to broadcast selected content on the web through several axes: visibility on the search engines, social networks, corporate sites of users ...
The system makes it possible to keep the selection made through a magazine containing all the relevant contents on a single public page.
The system offers online tools for content marketing: a marketing strategy that involves the creation and distribution by a company of media content to acquire new customers.
Operating environment of the system (Figure 2)
The system engine is a platform (defined as a set of services) that represents, that is, contains the references of a very large number of web page addresses. As an order of magnitude, the platform represents more than 50 million URLs. It is a curation system of editorial content and a community platform with a large audience. The architecture of the platform is based on the web architecture illustrated in Figure 2. As seen in this figure, the method is used in the context of an Internet type data network 201. The system implements: a module 202 for protection against denial of service attacks, a load balancing module 203 ("IP load balancing" and "http load balancing") between users.
It furthermore comprises on the one hand at least one web browsing server 204 ("crawling"), at least one page suggestion server 205, associated with a "big data" system 206 for storing web pages, it is ie a database storing a very large volume of internet pages.
The suggestion servers 205 and the protections 202 and load balancing modules 203 supply at least one application server 207 associated with an image server 208. The application server 207 implements a search engine 209.
Furthermore, the image server 208 is connected to a big data database 210 for storing images and a database 211 for relational storage. References to the images are stored in a relational database to allow joins between the noSQL systems and the relational stores.
The application server 207 and the image server 208 are connected to a cache database 212. The application server 207 is connected to an event storage database 214. This event storage system can be seen as a log system that can be used for maintenance operations, or internal or external - external = statistics for users. In addition to the application servers 207, a NoSQL 213 server cluster is used to store the unstructured data and execute on these data various algorithms for recommendation, classification, statistical analysis, etc.
Finally, the event storage database 214 feeds at least one asynchronous task calculation server 215, calculation tasks to provide statistics to the user but also for internal needs which also feeds the database of non-SQL 213 storage with the results of calculations. The exploration of the web and the collection of meaningful data on web pages is based on programs written for example in Python - trademark - and Java - trademark - which allow you to browse and extract essential information pages visited.
All or part of the 201 Internet access functions, 202 protection, 203 load balancing, 204 navigation servers, 205 suggestion servers, 206 internet pages storage, 207 application servers, servers 208, search engine 209, database management 210, 211, 212, 213, 214, and asynchronous computing servers 215 are executed by a central server 105 of Figure 1.
General operation of the curation process (Figure 3)
The suggestions proposed by the system implementing the method are derived from key words and sources selected by a curator user 101.
Suggestion is defined as an internet page address containing relevant information relating to a previously chosen theme, the latter being defined for example by a set of keywords.
We define by source the address of a web page or data server, for example but not limited to the present system.
In the present exemplary implementation, the method uses a very large part, or even all the data (that is to say, pages, texts, images) stored or referenced by the other users curators 101 and users readers 106 to qualify, order and filter the suggested content.
The suggestions sent to a curator user 101 come from both the keywords and the sources given by this curator user 101, but also from all the knowledge acquired by the system by analyzing the behavior of the other users curators 101 and users readers. 106 to contents (see Figures 4 and 7 and associated descriptions).
Figure 3 illustrates this operation. In a step 301, a curator user 101 determines keywords relating to a predetermined subject, or sources to be used for the response to a search, and between these data in the system implementing the method that is the subject of the present invention. The keywords and sources determined by the curator user 101 thus define search criteria of the user.
In a step 302, a suggestion engine (hereinafter referred to as "suggestion engine 302" for simplification), here, but not limited to, implemented by a central server, enriches the search criteria with keywords and / or sources suggested by the suggestion engine 302, thereby defining a search strategy.
It should be emphasized that the parameters of the suggestion engine 302 are preferably specific to each curator user 101. In other words, there is preferentially a distinct suggestion engine 302 per user curator 101.
During this step 302, the suggestion engine 302 also sorts the data (information, articles) to which it can access, and determines for these data a distance from an ideal response to the user's search criteria. . It then transmits to the user curator 101 the most relevant data (articles), classified for example by increasing distance to the ideal response.
The detail of step 302 is given in FIG.
In a step 303, the user curator 101 analyzes this data (including for example various internet pages) and determines, for each of them, in a step 310, whether it should be browsed and analyzed in more detail ("Read"). ). If this is not the case, the curator user 101 proceeds to the next suggestion. If this is the case, the suggestion is scanned in detail ("read") and then evaluated in a step 309 to determine whether it meets a predetermined criterion and therefore must be published ("curate") relative to the initial search, this predetermined criterion may for example take into account the date of the data or its source.
If the data meets the predetermined relevance criterion, in a step 304, the data is characterized as publishable within a file relating to the initially chosen subject.
Regardless of the classification of the datum relative to the criterion of relevance given by the curator user 101 in step 310 (to be published, not to publish, irrelevant), in a step 305, the datum is associated with a qualification. characterizing its relevance to initial research, or supplementing its description with various keywords or grade notes.
In a step 306, these complementary data qualifying elements in connection with the initial search are used to modify the setting parameters of the suggestion engine, thereby creating a feedback and learning loop of said suggestion engine.
In a step 307, the system determines whether the data should be shared or not.
If the data is to be published, in a step 308, the system determines one or more publication sites, and publication dates, according to predetermined criteria of maximum scope of the publication thus produced.
The detail of step 307 is given in FIG.
In order to be able to provide the curator user 101 with a large number of content suggestions that are relevant to his work topic, the system implementing the curation method described here must be able to discover in real time a larger proportion possible articles corresponding to the interests of a user. Indeed, providing quality content is not enough, it must also provide it in real time or as close as possible to this state. There is therefore a problem of speed of collecting new information and extracting the useful part. Indeed, the system must refine the qualification of an article and thus extract from the latter the information that is useful for its qualification. The uncertainty in this area lies in the selection of information necessary for the qualification of the article.
For reasons of readability of the suggestions by the curator user 101, the system must be able to extract the essence of the article, here defined as an image associated with the article as well as an extract of text significant for the article. understanding of the article.
The information collection system does not collect the entire web but simply a subset of the web corresponding to the subject handled by the users of the platform. However the main limitation lies in the fact of succeeding in extracting the parts necessary for a good subsequent analysis of the content. The idea is to provide system users with a maximum of relevant articles (and their associated data) captured on the Web in real time. It is therefore also a question of extracting the semantic information pertaining to pages whose structure varies from site to site. The system must find a generic solution (the diversity of data is important) and fast (it is to offer real-time articles to users).
To find a maximum of content, there are two known techniques. The first is the recursive technique of data mining / extraction, which consists of following a link linked to all the documents on the web. The second is to use and multiply the external services (in order to benefit from the work already carried out) in order to extract only the content a priori interesting while guaranteeing a sufficient speed of execution.
For the extraction of data from documents, there are solutions or tools that only implement a part of the needs for the exploration of web documents (exhaustive exploration of the web) and incomplete frameworks (set of software components). exploration and semantization for the web. Indeed the system must be able to pre-select a subset of the web before exploring it because exploring the entire web would be far too expensive in terms of resources and infrastructure.
The inventors have decided to develop their own solution, in an iterative way to be as simple and fast as possible, while remaining relevant.
Several methods in data collection can be considered. The algorithms implemented in the process focus on a restricted area of the Web. The system limits the exploration of the web to the domains of interest declared by the user curator 101 by keywords. These keywords allow via various APIs to select URLs constituting starting points for a search for content on the web.
In addition, the system has a lot of RSS feeds that can also serve as a basic reverential to web browsing.
Indeed, these RSS feeds have been entered by all the users curators 101 and users readers 106 and therefore the subset of the web they represent is the reflection of the relevant contents for users curators 101 and users readers 106. The extraction contents of the pages and data essential for the qualification of the contents knew several implementations: - Simple reading of the meta-data HTML, - Display of the pages visually in a pseudo navigator (Python and QtWebkit-deposited marks-) to find the information main. The rendering of the web pages via QtWebkit - registered trademark - makes it possible to render the page and thus to extract the content based on evolutive visual rules.
These techniques have drawbacks of poverty of extracted content or resources necessary for their implementation. The extraction of data is here based on an analysis of microformats data (http://en.wikipedia.org/wiki/Microformat) and in particular the protocol
OpenGraph (http://oqp.me/).
However some information is not available or some web pages do not provide this information. In addition, the information extracted is not numerous enough.
It is therefore desirable to implement new algorithms to extract the entire content of an article. The algorithm is mainly based on heuristics about the metadata and structure of the web page. The system uses a list of HTML elements and CSS classes that are often used to delineate the article in the web page. The algorithm therefore locates all the contents framed by these elements and classes. From this list of found contents, the method implements other heuristics to recognize the main article. For example, the process attaches importance to the size of the content found, and the place of this content in the structure of the HTML document.
On the basis of a list of popular websites, the method automatically ensures that the algorithm manages to extract content in sufficient quantity, that is to say greater than a predetermined value.
For the recovery of the impacts on the social networks of the analyzed pages, for each of the pages, one requests different social networks to determine the number of "like", "tweet", etc ... of an article.
Reading operation (Figure 4)
Once the contents have been extracted during the curation phase, they are published in the topics of the curator users 101, and are thus made available to the reader users 106, according to their areas of interest.
In the present nonlimiting example of implementation of the method, a user reader 106 discovers in a step 401 Internet content from several sources (at the top of the diagram Figure 4). These sources of content: social networks 402, search engine 403, "follow content" 404, recommendation or categorization 405, etc. to discover publishable content for the user in relation to his work topic. These contents seen in a step 406 make it possible to calculate a number of views for each page viewed by one, a part or all the users curators 101 and users readers 106. The reader user 106 can perform several actions on these contents: follow 407, share 408 (in this case at least one share destination is selected by the user), recommend 409 (in this case, the content is marked as qualified). These actions associated with the page views calculated in step 406 make it possible to calculate a score per content (step 410). Each of these actions gives value to the content for which the action is performed. These notations associated with algorithms make it possible to calculate a score for each content.
This score is then analyzed by the system in a step 411 to recommend and categorize the contents for their journey by the reader users 106 in step 405. The categorization is partly based on machine learning algorithms (in English machine learning) automatic. The user reader 106 also validates the choice of the category. This validation is subsequently used as input for machine learning algorithms.
The details of steps 409, 410 and 411, which complete step 305 of FIG. 3, are given in FIG. 7 relating to interests and recommendations.
How the Suggestion Engine Works (Figure 5)
The information collected on the Internet as well as the information selected by the curator users 101 must be filtered and classified to then be highlighted on the pages of the system.
Thus the articles collected on the Internet must be analyzed, classified, filtered for, subsequently be proposed to curator users 101. Indeed the first level of curation is done by the suggestion engine 302 which must be able from the all items retrieved daily from the internet, to select those that must be offered to a given user curator 101.
For this, several systems must be implemented: data analysis, classification and filtering before proposing these contents to the user curator 101. The execution time of the calculations must be fast to allow to propose new content several times by day.
In a first version, the suggestion engine 302 calls web APIs to find content to offer to the user curator 101. No pooling of global information (ie from all users) of the platform is not used.
In a second version using content recommendation and classification algorithms, the experience gained allows to qualify, filter and order the information collected on the web.
Given the peculiarity of the content of the platform, which consists essentially of citations of pre-existing articles on the web, the system is interested in the recommendation of short content.
The system platform offering social interaction features, the algorithms taking advantage of the social aspect of the data correspond to the needs of the system. Particular attention to the volume of data is necessary in accordance with the evolution of the system. The scalability of the algorithms is important, ie their ability to accommodate a volume of data that can increase significantly over time.
The system implements a collaborative filter algorithm implementation via a search engine. All the content collected via the suggestion engine 302 is stored in a "big data" system (English expression designating data sets so large that they become difficult to work with conventional database management tools or information management - source Wikipedia). These data are then indexed and enriched by internal or external meta data. This index makes it possible subsequently to propose to the users data thus collected without going through the web APIs. Thus the curation process, here described by way of non-limiting example, enriches as time goes by its own non-curated content bases that may be useful to future curator users 101.
Thus, in addition to the data analysis algorithms, it is necessary to filter and sort the contents so as not to present the unattractive contents and to display the contents in the best possible order for the user curator 101. user curator 101 enters keywords for the system to offer content, so the first way to filter is based on these keywords. Since the system offers content to curator users 101 who then have the choice to select it for publication or to reject it, it is possible, in a subsequent step, to learn behaviors of curator users 101 to classify and filter the content.
The suggestion engine 302 is directly dependent on the data collection portion of the web. So the problem revolves around the volume of data extracted. If the latter is not large enough, the suggestion engine 302 lacks data to analyze, sort, organize. On the other hand, data collection must extract important data for the analysis, sorting, organization algorithms to work. The amount of data now envisioned is 25 million pages scanned per day, and it seems ideal for algorithms.
The purpose of the suggestion engine 302 is to take advantage of all the articles extracted from the web, according to various key words or predetermined criteria, in order to propose them to the curator users 101. For this purpose, in a step 501 (see FIG. 5) . content in the form of URLs are retrieved to the server 102 by content sources associated with the keyword entered by the user curator 101 (in step 301). The goal is then to sort them, filter and organize.
The suggestion engine 302 implements an algorithm that uses as a primary sort the number of keywords detected in the suggestions. If two suggestions have the same number of keywords, the secondary sort used is the publication date of the content. This algorithm combines the relevance and freshness of the content. It also makes sorting comprehensible to users because all the criteria used are displayed.
Some data from articles on the web are essential when extracting content to allow a good qualification of said content. These data are then the basis of algorithms for classifying, sorting and filtering content. At the level of these data, several informations are sought: the dates of the articles, the author, the image of illustration, etc. Regarding the date of articles, this information is sometimes not present or hidden within the content.
The social popularity of the articles is, on the other hand, one of the best clues to the quality of an article but it is difficult to obtain. This type of metric is the property of the various social networks, and to interrogate these social networks in an intensive way (30 million / day) requires a consequent architecture.
As illustrated in a nonlimiting manner in FIG. 5, the suggestion engine 302 implements a set of algorithms capable of choosing, classifying, and qualifying contents to be proposed to the user curator 101. These algorithms are based on the body of knowledge of the behaviors of the curator users 101 and on all the content that the system possesses or is capable of searching on the web. They also use the keywords entered by curator users.
As we have seen, in a step 301, the curator user 101 determines key words and / or sources in URL form. In a step 502, the selected keywords have used to browse data domains of Twitter -type-filed, Facebook -marked-etc. to retrieve URLs of pages relevant to these keywords. In step 501, the system retrieves the contents of the selected web pages and stores them in memory.
In a next step 503, the system extracts from these web pages the texts, images and any associated RSS feed addresses.
In a step 504, the RSS feeds are stored and feed a database 505. In a step 506, the system loops through the RSS feed URLs to search the RSS URLs corresponding to the predefined keywords. Then, these RSS feeds are loaded in a step 507. From the texts, images and RSS feeds extracted during step 503, in a step 514, the system indexes and stores suggestion snippets. These suggestions come to feed a database 515 of suggestions in the form of URLs associated with the keywords of the search. Storage consists of storing an url associated with a set of data extracted from the page: date, title, useful content (refined decoration), essential images of the page, etc. as well as metadata helping to qualify (keywords, etc.).
In a step 516, the system searches for user-selected keywords in the suggestion database 515 including all previous suggestions for all users in the database.
The data extracted according to this search are filtered in order to eliminate the pages already seen by the user during the present search (step 517), and to apply other filters previously defined by said user (step 518).
In a step 519, the suggestions are sorted according to predefined criteria, for example by associated date, or by a more complex criterion of quality score or other. These score criteria may come from machine learning data stored in a database 520.
Finally, in a step 521, the system presents the retained, filtered and sorted suggestions to the user in response to his request.
Example of evaluating the distance of a content from an ideal response
The distance to an ideal response of a search according to the search criteria of the user is composed of static criteria, that is to say invariable in time and not according to the user curator, and criteria dynamic.
The first static criterion allowing the evaluation of the distance with respect to an ideal response is a function of the placement of the keywords determined by the user curator 101 in a content extracted by the suggestion engine 302. The evaluation of this placement corresponds the position of the keywords in the title, in the body of the page, in the url of the page or in commentary of the page. In other words, the evaluation of the placement of the keywords in the content makes it possible to evaluate the relevance of the content with respect to the keywords.
Another static discriminant criterion is the language of the content, which is determined automatically and algorithmically. The language of the content should preferably correspond to the language of the curator user 101.
The quality of the suggested content is another static criterion. Quality is evaluated based on the volume of content, the variety of vocabulary, the size and the amount of associated artwork. It should be emphasized that the evaluation of the quality of the content is evaluated by an algorithm which will be enriched by learning according to the user actions.
The popularity of the content via social networks is a criterion taken into account that evolves over time, but is not a function of the user curator 101. All users of the platform 101 user, through their interaction with their engine 302 individual suggestions, provide information on each extracted content. This information is a correlation between the acceptance or otherwise of the suggested content, and the intrinsic quality of that content. This correlation is taken into account as a dynamic criterion in the distance to an ideal response.
Finally, the interactions of a curator user 101 with his suggestion engine 302 makes it possible to establish a proximity between the subject addressed by the curator user 101 and the subject of the suggested content. This criterion evolves over time and is calculated individually for each user curator 101.
Operation of the sharing module and display calendar optimization module (see Figure 6)
Users 101 system curators wish their ecosystem, defined as the set of users 106 of web pages who regularly follow their publications, referencing, reactions on social networks, to the best advantage of the articles they have selected. Thus the simple publication on a page of the system is not enough. The goal is to share these articles on different social networks. However, shares on random time intervals are not effective, for example in terms of the number of pages viewed. The objective of the system is then to share the selected documents on time slots automatically adapted to the audience of the user curator 101. For this the system can (in step 308 seen above) be based on the whole existing data both at the level of the user curator 101 and the set of users to determine, for each triplet audience / theme / social network publication, a set of times and intervals of publication considered most effective according to a predetermined criterion, for example using the number of pages viewed.
The goal is to build algorithms to create a system that can automatically select the best moments or share content based on the user's audience, the theme of the user and the social network concerned. Some time slots are commonly accepted as better than others based on social networks. To refine the best moment of sharing it is proposed to include the reactions generated on a given social network, and to make suggestions of slots to the users curators 101.
In one variant, the system repeats the same information several times on each of the social networks. This repetition is not random but also based on analyzes of the results of the shares. This method makes it possible to reach more people even though the mass of information on social networks does not allow the reader to consult all information flows constantly.
A first challenge is the ability of the system to know the "reactions" generated by user sharing in order to ultimately determine the best time slots for each user.
A second challenge is to establish rules based on themes (categories) of content. The goal is initially to manage preferred hours of scheduling for sharing on social networks. These preferred times will be statically generated, and may subsequently be modified for each of the user 101 / topic user pairs.
In the implementation mode described here, preferential hours for each social network are previously determined, and the sharing is done primarily at these preferential hours according to the time zone defined by the topic administrator.
So the next step is to extract for each shared article the number of reactions it generates. For this we use the APIs of different social networks.
It is also possible to use the number of views on the article based on the referer (referer) of the user's browser 101. Thus it is possible to calculate a score for each share (for example calculated as the number of reactions on social networks plus the number of views). This score represents the success of a share. By doing average success calculations per hour, it is possible to define the times most conducive to sharing on each of the social networks.
With the knowledge of the amplification of sharing on social networks by sharing analysis made from the system and by the objectives set by the users, the display calendar optimization module allows the user curator 101 to share these contents at the optimal moment so that they receive the best possible echo on the social networks.
Figure 6 details the operation of the sharing module and display timing optimization module.
As seen in this figure, when, in a step 601, the curator user 101 decides to share a content, he is asked in a step 602 whether he wishes an immediate sharing or not.
In the case where he wishes an immediate sharing, the content is shared in a step 603 and published on the web (step 604). Simultaneously, the details of this sharing (date, time, content, destination etc.) are stored in a step 605, and in a step 606, the system analyzes the impact obtained by the sharing made, so as to feed a database of data for machine learning. The impact obtained can be measured by the number of views of the page or by other parameters (length of stay on the page, quotes etc.).
In the case where the curator user 101 does not require an immediate sharing of the content, in a step 610, he chooses a date based on a result objective or not. If he does not want a goal-based schedule, in a step 611, he chooses a share date, and in a step 612, the share to be shared is added to a queue, in view of his publication on the chosen date.
In the case where the curator user 101 chooses during a step 610 a date based on a result objective, a preferred date of sharing having been previously defined in a step 613, the system determines, in a step 614, the date ( or a set of dates) best suited for the goal of maximizing the result, and then adds the content to be published in the publication queue, associated with this or these calculated sharing dates.
Operation of the recommendation system and classification (Figure 7)
This system, based on the behaviors and actions of all the users curators 101 and users readers 106, makes it possible to highlight contents, to categorize them in thematic groups, to put in relation the users readers who have the same interests.
Figure 7 illustrates its operation in detail.
When any user of the system reads content, shares content, or qualifies content with a "like" marker, the system records, in a step 409, those user actions, attached to said content to qualify it. The system then calculates in a step 410 a score for the content, based on these various elements of appreciation collected over time.
According to the result of a score comparison step 411 with predefined threshold values, if the score of the content is less than a first predetermined threshold value but greater than a second predetermined threshold value, the system calculates (step 703) three relevant categories. for the content, proposes (step 704) the user to choose himself a category, this user's choice being used to improve in a step 705 the automatic choice engine of categories. It is clear that the number of calculated categories may be greater than or less than three in alternative embodiments of the present method.
Finally, in a step 706, the content is used for recommendations. In other words, once the content is categorized, it is displayed in a dedicated part of the site in its category with its ranking compared to the other contents in this section.
If the score of the content is greater than the first threshold value, it is used for recommendations, and, if it has no categories attached, the system calculates, as before, three categories that will come to qualify it.
If the score of the content is less than the second threshold value, it is not used for content recommendations.
Data storage
The system must be capable of storing large volumes of data corresponding to the indexed pages during the reading step. The goal in terms of the number of articles stored is 30 million / day, or about 30% of the articles browsed and analyzed. Moreover, for each article the user can choose an image to associate with him. For this purpose, the system must also be able to store a large number of images and be able to serve them very quickly.
In a context of great volumetry, particular care is taken to be able to read efficiently and quickly articles to propose to a user. For storing articles extracted from the web, relational databases are hardly compatible with intensive data writing. In recent years, non-relational databases (NoSQL) have emerged. Some of these databases are particularly suitable for intensive data writing. The system must be able to store more than 30 million articles per day while allowing users to serve these contents in real time. They must also be stored in a sufficiently efficient manner so that they can be quickly provided to users.
This raises a competition problem in reading and writing. The system must be able to quickly write a large amount of information of relatively small size (an article) but also to deliver a set of complete articles selected in "batch".
In order to efficiently store data, the system uses SQL and NoSQL storages in parallel.
Many storage systems have been devised for the storage of data extracted from the Internet: - SQL database abandoned because the data volume is too large, - NoSQL too slow in reading.
Also, the inventors have developed a redundant storage system (GPDB -Grabbed Post Data Base) where writes are concatenated on a data file. Each read request fully reading one of the data files. This approach makes it possible to benefit from very good writing performance without blocking the readings of all the data in a single request.
The redundancy of this storage system is ensured by synchronization (rsync) of the file system from a master machine to a secondary server. This synchronization is performed for example once a day. However, such replication has many disadvantages in case of loss of the master machine: - the data are synchronized only once a day, so they do not take into account the latest additions / deletions, - a manual action is required to declare the second server as the master server.
It is desirable to allow real-time synchronization of the data. Thus the inventors have chosen, in the present nonlimiting example of implementation, to opt for a replication "Master-Slave". Like MySQL, the master server writes a log file of all write actions to be performed on the data. The slave server reads the master's log file and executes the operations sequentially.
High Availability of Servers and Software Layers The software and hardware architecture needs to be analyzed and improved to meet scalability challenges. The high availability of a Web service platform depends primarily on the architecture implemented as well as the ability of the software layers to respond effectively. From an architecture point of view, an LVS cluster provides load balancing. Subsequently, it is essential to provide a second level of load balancing to the application servers. This second layer can be provided by Apache servers. From an application point of view, two factors are essential to enable high application availability. First of all, rendering the application servers stateless makes it possible to overcome the problems of synchronization between application layers, but also to be able to support the shutdown of one of the application machines (maintenance or breakdown). On the other hand, the database is often the individual point of failure. It must therefore be possible to use either Master-Slave or Master Master replication techniques, clustering or block-to-block replication of the file system hosting the database. Finally, the reduction of pages loading time can be performed by two axes: the optimization of queries performed but also via a layer of distributed caches, for example Memcached.
The system aims to achieve a high level of visibility on the web and a large number of pages delivered (typically 15 million page views per month). The problem is therefore to implement software and hardware layers to absorb peak loads while ensuring a constant response time. The inventors have developed a caching technique to ensure the integrity of the data.
For this purpose, three types of operations can be distinguished that can lead to a writing of a resource in the cache: - a reading of the resource from the database - a writing of the resource in the database - a deletion of the resource of the database
Each operation is assigned a priority. If two cache writes are "at the same time", the write that comes from an operation with the highest priority wins. For example, if two servers access the same resource at the same time, one modifies the resource, the other reads it; only the version of the resource coming from the modification of this one will have to be cached. The same applies to a deletion of a resource: the deletion being final, no further caching of the resource should be possible after the deletion. The system uses a "tombstone" that is cached in place of the resource with the highest priority.
Managing cache scripting competition uses a classic "Compare-and-Swap" mechanism. Thus, the writes are guaranteed sequentially for each resource. The priority system assigned to the reason for writing guarantees the integrity of the cached data with respect to our transactional model.
SEO optimization
Search engines constantly update their algorithms to adapt to new Internet usages and improve the relevance of their results, and these algorithms remain secret.
At the content structure level, the system uses the addition of HTML5 semantic tags in the source code of the pages. These tags <article>, <header>, <nav>, <footer>, ... allow search engines to more easily identify the structure of the pages and the contents that are put forward.
De-duplicate URLs of pages that share the same URL because they are loaded in Ajax (for example, tabbed browsing) will be implemented. The use of the PushState function allows search engines to distinguish between these pages, thus increasing the number of indexed pages.
Advantages
The system described above makes it possible to discover, suggest and organize web articles based on the interests of users and to offer users increased visibility on the web.
The system is a platform that must be constantly evolving and constantly enriched with new content and must be constantly able to offer content that is more and more close to the interests of each user while remaining fast and able to meet the needs of the user. hearing in constant increase. In addition, the system makes it possible to position the user as a thought leader in his theme. For this the system must allow sharing at the right time on social networks.

权利要求:
Claims (11)
[1" id="c-fr-0001]
A method of suggesting content extracted from a set of information sources, characterized in that the method comprises steps of: - 301 determining by a user curator keywords and / or search sources to use, thus defining search criteria of the user; - 302 enriching the search criteria of the user by keywords and / or sources suggested by a suggestion engine, thus defining a search strategy; extracting at least one content from the set of sources of the search strategy according to the keywords of the search strategy; sorting the contents according to a distance with respect to the search criteria of the user; - display of the sorted contents to the curator user; - 303 selection by the curator user of content deemed relevant; - 304 publication of the contents selected by the user curator for users readers; - 305 recording feedback from curator users and reader users for each published content; Automatically modifying the suggestion engine suggestion parameters based on the suggestion engine's analysis of the recorded responses, thereby creating a feedback and learning loop of said suggestion engine.
[2" id="c-fr-0002]
2. Method according to claim 1, also comprising a following step: 308, determining one or more publication sites, and publication dates, according to previously determined criteria for maximizing the visibility of the publication thus produced.
[3" id="c-fr-0003]
3. Method according to any one of claims 1 to 2, comprising: a step 406 for calculating a number of views for each page viewed by one, a part or all the users curators (101) and / or users readers (106), a step 410 of calculating a score for each content, according to the number of pages viewed and the type of action performed by a reader user (106) on this content, the type of action being included in the list comprising: follow 407, share 408, recommend 409, - step 411 score analysis to recommend and categorize the contents for qualification by users curators (101) and / or users readers (106).
[4" id="c-fr-0004]
4. The method of claim 3, wherein the categorization is performed using at least one automatic machine learning algorithm.
[5" id="c-fr-0005]
The method of any one of claims 1 to 4, wherein the suggestion engine implements: a step 502, wherein the keywords selected in step 301 are used to browse data domains to extracting URLs of relevant pages with respect to these keywords, a step 501, in which the system retrieves the content of the selected web pages and stores them in memory, - a step 503, in which the system extracts from these web pages the texts, images and any associated RSS feed addresses, - a step 504, in which these RSS feeds are stored and feed a database (505), - a step 506, looping RSS feed URLs to search the RSS URLs corresponding to the predefined keywords, - a step 507 for loading these RSS feeds, - a step 514 in which, from the texts, images and RSS feeds extracted in step 503, the system i ndexe and stores elements of suggestions, these suggestions coming to feed a database (515) of suggestions, - a step 516, in which the system performs a search of the keywords selected by the user in the base of suggestions, - a step 517 of filtering the extracted data according to this search to eliminate the pages already seen by the user during a present search, - a step 518 of application of other filters previously defined by said user, a step 519, in which suggestions are sorted according to predefined criteria.
[6" id="c-fr-0006]
6. Method according to any one of claims 1 to 5, wherein, in step 308, the system is based on the set of data already existing both at the level of the user curator (101) and the set of users curators (101) and / or users readers (106) to determine for each triplet audience / theme / publication network, a set of moments and intervals of publication considered most effective according to a predetermined criterion.
[7" id="c-fr-0007]
7. The method of claim 6, further comprising a step in which the system extracts, for each shared article, the number of reactions it generates, calculates a score for each share, and deduces the times most conducive to sharing on each of the social networks.
[8" id="c-fr-0008]
The method of claim 7, comprising a step 605 in which the details of each share are stored, the details including the date, time, content and destination, and a step 606, in which the system analyzes the impact achieved by the sharing made, so as to feed a database for machine learning.
[9" id="c-fr-0009]
The method according to any one of claims 1 to 8, wherein, step 306 implements an algorithm based on the behaviors and actions of all curator users (101) and / or reader users (106). ), to highlight content, to categorize them in thematic groups, to connect users who have the same interests.
[10" id="c-fr-0010]
The method according to claim 9, comprising: a step 409 of recording the actions of each reader user (106): reading a content, sharing a content, qualifying a content, or recommending a content , in a manner attached to said content to qualify it. a step 410 of calculating a score for the content, on the basis of these qualification elements collected in the course of time, a step 411 of comparing the score with predefined threshold values, i) if this score of the content is less than a first predetermined threshold value but greater than a second predetermined threshold value, a step 703 for calculating categories that are relevant for the content, a step 704 for proposing to the user to choose a category himself, this user's choice being used to improve in a step 705 the automatic category selection engine, a step 706, use of the content for recommendations, ii) if this score of the content is greater than the first threshold value, the content is used for recommendations , and, if it does not have categories attached, the system calculates, as before, categories that will come to qualify it.
[11" id="c-fr-0011]
11. Computer system for suggesting content extracted from a set of information sources, executing a computer program product implementing a method of suggesting content extracted from a set of information sources according to any one of Claims 1 to 10.

类似技术:

公开号 | 公开日 | 专利标题

FR3043816B1|2019-06-14|METHOD FOR SUGGESTION OF CONTENT EXTRACTED FROM A SET OF INFORMATION SOURCES

US10180967B2|2019-01-15|Performing application searches

Guinness et al.2018|Caption crawler: Enabling reusable alternative text descriptions using reverse image search

EP1719061A2|2006-11-08|Methods of manipulating information objects and of accessing such objects in a computer environment

FR2947358A1|2010-12-31|A CONSULTING ASSISTANT USING THE SEMANTIC ANALYSIS OF COMMUNITY EXCHANGES

FR2973134A1|2012-09-28|METHOD FOR REFINING THE RESULTS OF A SEARCH IN A DATABASE

Lewandowski2017|Is Google responsible for providing fair and unbiased results?

Trevisiol et al.2012|Image ranking based on user browsing behavior

WO2014191703A1|2014-12-04|Method for searching a database

WO2001035269A2|2001-05-17|System for sharing data between at least two users on a computer network

Abdelouarit et al.2016|Towards an approach based on hadoop to improve and organize online search results in big data environment

EP2834757B1|2019-05-08|Method and device for rapid provision of information

FR2939537A1|2010-06-11|SYSTEM FOR SEARCHING VISUAL INFORMATION

US20170220644A1|2017-08-03|Media discovery across content respository

FR2975553A1|2012-11-23|HELP IN SEARCHING VIDEOS CONTENT ON A COMMUNICATION NETWORK

Catasta et al.2014|B-hist: Entity-centric search over personal web browsing history

Lucchese et al.2014|Recommender Systems.

Ginsca2015|Leveraging large scale Web data for image retrieval and user credibility estimation

WO2001095146A2|2001-12-13|System for semi-automatic import of fragments of information resources

Vassilakis et al.2020|Database knowledge enrichment utilizing trending topics from Twitter

Huurdeman et al.2017|4 A Collaborative Approach to Research Data Management in a Web Archive Context

EP2746960A1|2014-06-25|Method of structuring alphanumerical data

FR3051936A1|2017-12-01|METHOD AND DEVICE FOR CLASSIFYING MULTIMEDIA CONTENT, TERMINAL AND CORRESPONDING COMPUTER PROGRAM

Paniagua Laconich2015|Event-centric management of personal photos

WO2018015515A1|2018-01-25|Methods for opinion sharing, computer programs and hardware for implementing methods

同族专利:

公开号 | 公开日

FR3043817A1|2017-05-19|

US20170139939A1|2017-05-18|

FR3043816B1|2019-06-14|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US20150006492A1|2013-06-26|2015-01-01|Michael Wexler|System and method for generating expert curated results|

US20150066917A1|2013-08-29|2015-03-05|Fujitsu Limited|Item selection in curation learning|

US9202384B2|2012-10-31|2015-12-01|D2L Corporation|System and method for gating notifications|

US10614143B2|2017-08-28|2020-04-07|Facebook, Inc.|Systems and methods for automated page category recommendation|

US10963627B2|2018-06-11|2021-03-30|Adobe Inc.|Automatically generating digital enterprise content variants|

CN110489626A|2019-08-05|2019-11-22|苏州闻道网络科技股份有限公司|A kind of information collecting method and device|

法律状态:
2017-08-04| PLSC| Publication of the preliminary search report|Effective date: 20170804 |

2017-11-30| PLFP| Fee payment|Year of fee payment: 2 |

2019-11-29| PLFP| Fee payment|Year of fee payment: 4 |

2020-11-27| PLFP| Fee payment|Year of fee payment: 5 |

2021-11-30| PLFP| Fee payment|Year of fee payment: 6 |

优先权:

申请号 | 申请日 | 专利标题

FR1561003A|FR3043817A1|2015-11-16|2015-11-16|METHOD FOR SEARCHING INFORMATION IN AN INFORMATION SET|

FR1561003|2015-11-16|

[返回顶部]